EN FR
EN FR


Section: New Results

Hardware Arithmetic and Architecture

Participants : Florent de Dinechin, Hong Diep Nguyen, Bogdan Pasca, Honoré Takeugming, Álvaro Vázquez Álvarez, Nicolas Brunie, Sylvain Collange.

FPGA-specific arithmetic

Reconfigurable computing has the opportunity of using exotic operators that would not make sense in a general-purpose microprocessor [43] , for instance the constant dividers studied in 6.1.2 . Such operators must be also be matched to the precision and performance needed by applications. F. de Dinechin and B. Pasca described the FloPoCo framework that assists the construction of correct pipelines and the automatic testing of such operators [28] . For this context, B. Pasca, with H. D. Nguyen, now at U.C. Berkeley, and T. Preusser, from T. U. Darmstadt, described improved architectures for short-latency adders on modern FPGAs [39] . With Ch. Alias and A. Plesco (Compsys project-team), he studied the integration in of deeply pipelined arithmetic datapath in high-level synthesis tools [51] .

Multiplication by Rational Constants versus Division by a Constant

Motivated by the division by 3 or by 9 appearing in some stencil kernels, F. de Dinechin investigated how the periodicity of the binary representation of a rational constant could be exploited to design an architecture multiplying by this constant [26] . With L. S. Didier, this approach was then compared to a specialisation of divider architectures to the division by small integer constants, which is shown to match well the fine structure of FPGAs [44] .

Elementary Functions

A. Vázquez worked with J. Bruguera, from U. Santiago de Compostella, on hardware architectures for evaluating q-th roots [66] . Their solution composes digit-recurrence operators for reciprocal, logarithm, multiplication and exponential.

Extensions of the fused-multiply-and-add operator

With B. de Dinechin, from Kalray, N. Brunie and F. de Dinechin proposed to extend the classical fused-multiply-and-add operator with a larger addend and result. This enables higher-precision computation of sums of products at a cost that remains close to that of the classical FMA [56] .

Emerging throughput-oriented architecture

On massively multi-threaded processors like GPUs, neighbor threads are likely to operate on similar data. S. Collange showed with A. Kouyoumdjian how it is possible to take advantage of this inter-thread value correlation at the hardware level with a hardware cache-compression technique on GPUs [59] . With D. Sampaio, R. Martins, and F. Magno Quintão Pereira (U. Minas Gerais), he then addressed this question also at the compiler level using a compiler stage to identify statically data patterns in GPGPU programs [65] .

Current GPU architectures require specific instruction sets with control-flow reconvergence annotations, and only support a limited number of control-flow constructs. S. Collange and N. Brunie, with G. Diamos (NVIDIA) generalized dynamic vectorization to arbitrary control flow on standard instruction sets with no compiler involvment [46] , [57] , [54] . In addition, this technique allows divergent branches to be executed in parallel, as a way to increase the throughput of parallel architectures [55] .